Module 7 - Recommender Systems.

Introduction :

Online E-commerce websites like Amazon, Flipkart uses  different recommendation models to provide different  suggestions to different users. Amazon currently uses  item-to-item collaborative filtering, which scales to massive  data sets and produces high-quality recommendations in  real-time. 

Dataset:

Amazon Reviews data: For this case study, we are using the Electronics  dataset. 

Domain :

E-commerce

Attributes:

● userId : Every user identified with a unique id 

● productId : Every product identified with a unique id 

● Rating : Rating of the corresponding product by  the corresponding user  

● timestamp : Time of the rating ( ignore this column  for this exercise

In [1]:
#import necessary libraries

from sklearn import preprocessing
import warnings 
warnings.filterwarnings('ignore')


#importing numerical library
import numpy as np

#To handle data in the form of rows and columns
import pandas as pd

#To enable plotting graphs in jupyter notebook
import matplotlib.pyplot as plt
%matplotlib inline 

#Importing library for statistical graphs
import seaborn as sns

#Importing sklearn function for splitting dataset into training and test set
from sklearn.model_selection import train_test_split

#Statistical library
from scipy.stats import norm

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import math
import json
import time
from sklearn.metrics.pairwise import cosine_similarity

from sklearn.externals import joblib
import scipy.sparse
from scipy.sparse import csr_matrix
import warnings; warnings.simplefilter('ignore')
In [3]:
#importing Surprise library for collaborative filtering

from surprise import Dataset
#Importing KNN,SVD,SVDpp,SlopeOne,NMF,Normalpresdictor,KNNBaseline,KNNBasic,KNNWithMeans,KNNWithZcore
#,BaselineOnly , Coclustering, cross_validate algorithm from surprise library
#from surprise import KNNWithMeans
#from surprise import SVD
#from surprise import SVDpp
#from surprise import SlopeOne
#from surprise import NMF
#from surprise import NormalPredictor
#from surprise import KNNBaseline
#from surprise import KNNBasic
#from surprise import KNNWithMeans
#from surprise import KNNWithZScore
#from surprise import BaselineOnly
#from surprise import CoClustering
#from surprise.model_selection import cross_validate



#The surprise.accuracy module provides tools for computing accuracy metrics on a set of predictions.
from surprise import accuracy

#sklearn for feature extraction & modeling
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

Deliverable 1. Read and explore the given dataset. (Rename  column/add headers, plot histograms, find data  characteristics) - (2.5 Marks) 

In [4]:
#Importing dataset using pandas dataframe function


original_df = pd.read_csv('ratings_Electronics (1).csv',names = ['user_id','product_id','ratings','Timestamp'])
original_df.shape
Out[4]:
(7824482, 4)

Originally Dataset does not have any column name so i have given columns with the read_csv funtion of pandas. There are 7824482 Rows and 4 columns.

In [5]:
#let's have a look of first few rows
original_df.head()
Out[5]:
user_id product_id ratings Timestamp
0 AKM1MP6P0OYPR 0132793040 5.0 1365811200
1 A2CX7LUOHB2NDG 0321732944 5.0 1341100800
2 A2NWSAGRHCP8N5 0439886341 1.0 1367193600
3 A2WNBOD3WNDNKT 0439886341 3.0 1374451200
4 A1GI0U4ZRJA8WN 0439886341 1.0 1334707200
In [8]:
#Let's have a look of datatypes of each column

original_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7824482 entries, 0 to 7824481
Data columns (total 4 columns):
user_id       object
product_id    object
ratings       float64
Timestamp     int64
dtypes: float64(1), int64(1), object(2)
memory usage: 238.8+ MB
In [9]:
# Summary statistics of 'rating' variable
original_df['ratings'].describe().transpose()
Out[9]:
count    7.824482e+06
mean     4.012337e+00
std      1.380910e+00
min      1.000000e+00
25%      3.000000e+00
50%      5.000000e+00
75%      5.000000e+00
max      5.000000e+00
Name: ratings, dtype: float64
In [7]:
#Following code dropping the column timestamp as it is irrelevant for our recommender sysytem

new_df = original_df.drop(['Timestamp'],axis = 1)
new_df.head()
Out[7]:
user_id product_id ratings
0 AKM1MP6P0OYPR 0132793040 5.0
1 A2CX7LUOHB2NDG 0321732944 5.0
2 A2NWSAGRHCP8N5 0439886341 1.0
3 A2WNBOD3WNDNKT 0439886341 3.0
4 A1GI0U4ZRJA8WN 0439886341 1.0
In [11]:
#checking the prescence of missingvalues
new_df.isnull().sum()
Out[11]:
user_id       0
product_id    0
ratings       0
dtype: int64

Description - No null value found.

In [12]:
# find minimum and maximum ratings
print('The minimum rating is: %d' %(new_df['ratings'].min()))
print('The maximum rating is: %d' %(new_df['ratings'].max()))
The minimum rating is: 1
The maximum rating is: 5
In [13]:
# Check the distribution of ratings 
with sns.axes_style('white'):
    g = sns.factorplot("ratings", data= new_df, aspect=2.0,kind='count')
    g.set_ylabels("Total number of ratings") 
Out[13]:
<seaborn.axisgrid.FacetGrid at 0x1b5e04ff518>

**Description - Above grap

In [14]:
# Number of unique user id and product id in the data
print('Number of unique USERS in Raw data = ', new_df['user_id'].nunique())
print('Number of unique ITEMS in Raw data = ', new_df['product_id'].nunique())
Number of unique USERS in Raw data =  4201696
Number of unique ITEMS in Raw data =  476002
In [15]:
new_df['product_id'].nunique()
Out[15]:
476002

Visualization :::

Ratings Distribution :

In [16]:
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)

data = new_df['ratings'].value_counts().sort_index(ascending=False)
trace = go.Bar(x = data.index,
               text = ['{:.1f} %'.format(val) for val in (data.values / new_df.shape[0] * 100)],
               textposition = 'auto',
               textfont = dict(color = '#000000'),
               y = data.values,
               )
# Create layout
layout = dict(title = 'Distribution Of {} Product-ratings'.format(new_df.shape[0]),
              xaxis = dict(title = 'Rating'),
              yaxis = dict(title = 'Count'))
# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)

We can see that over 55% of all ratings in the data are 5, and followed by 4,1,3 and 2. low rating product mean they are not liked by many users.

Ratings Distribution By Product:

In [17]:
# Number of ratings per product
data = new_df.groupby('product_id')['ratings'].count().clip(upper=50)

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'ratings',
                     xbins = dict(start = 0,
                                  end = 50,
                                  size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per Product (Clipped at 100)',
                   xaxis = dict(title = 'Number of Ratings Per Product'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)
In [18]:
new_df.groupby('product_id')['ratings'].count().reset_index().sort_values('ratings', ascending=False)[:10]
Out[18]:
product_id ratings
308398 B0074BW614 18244
429572 B00DR0PDNE 16454
327308 B007WTAJTO 14172
102804 B0019EHU8G 12285
296625 B006GWO5WK 12226
178601 B003ELYQGG 11617
178813 B003ES5ZUU 10276
323013 B007R5YDYA 9907
289775 B00622AG6S 9823
30276 B0002L5R78 9487

Description : The most rated product has received 18244 ratings.

Ratings Distribution By User :

In [19]:
# Number of ratings per user
data = new_df.groupby('user_id')['ratings'].count().clip(upper=50)

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0,
                                  end = 50,
                                  size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per User (Clipped at 50)',
                   xaxis = dict(title = 'Ratings Per User'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)
In [20]:
most_rated = new_df.groupby('user_id')['ratings'].count().reset_index().sort_values('ratings', ascending=False)[:10]
most_rated
Out[20]:
user_id ratings
3263531 A5JLAU2ARJ0BO 520
3512451 ADLVFFE4VBT8 501
2989526 A3OXHLG6DIBRW8 498
3291008 A6FIAB28IS79 431
3284634 A680RUE1FDO8B 406
755206 A1ODOGXEYECQQ8 380
2424036 A36K2N527TXXJN 314
1451394 A2AY4YUOX2N1BQ 311
4100926 AWPODHOB4GFWL 308
1277963 A25C2M3QF9G7OQ 296

the most productive user have given 520 ratings.

Deliverable 2. Take a subset of the dataset to make it less sparse/ denser.  ( For example, keep the users only who has given 50 or  more number of ratings ) - (2.5 Marks) 

In [11]:
#Data model preparation as per requirement on number of minimum ratings
counts = new_df['user_id'].value_counts()
df_final = new_df[new_df['user_id'].isin(counts[counts >= 50].index)]

Calculate the density of the rating matrix

In [12]:
#let's have a look of final dataframe
df_final.head()
Out[12]:
user_id product_id ratings
94 A3BY5KCNQZXV5U 0594451647 5.0
118 AT09WGFUM934H 0594481813 3.0
177 A32HSNCNPRUMTR 0970407998 1.0
178 A17HMM1M7T9PJ1 0970407998 4.0
492 A3CLWR1UUZT6TG 0972683275 5.0
In [15]:
final_ratings_matrix = df_final.pivot(index = 'user_id', columns ='product_id', values = 'ratings').fillna(0)
print('Shape of final_ratings_matrix: ', final_ratings_matrix.shape)

given_num_of_ratings = np.count_nonzero(final_ratings_matrix)
print('given_num_of_ratings = ', given_num_of_ratings)
possible_num_of_ratings = final_ratings_matrix.shape[0] * final_ratings_matrix.shape[1]
print('possible_num_of_ratings = ', possible_num_of_ratings)
density = (given_num_of_ratings/possible_num_of_ratings)
density *= 100
print ('density: {:4.2f}%'.format(density))
Shape of final_ratings_matrix:  (1540, 48190)
given_num_of_ratings =  125871
possible_num_of_ratings =  74212600
density: 0.17%
In [16]:
# Matrix with one row per 'Product' and one column per 'user' for Item-based CF
final_ratings_matrix_T = final_ratings_matrix.transpose()
final_ratings_matrix_T.head()
Out[16]:
user_id A100UD67AHFODS A100WO06OQR8BQ A105S56ODHGJEK A105TOJ6LTVMBG A10AFVU66A79Y1 A10H24TDLK2VDP A10NMELR4KX0J6 A10O7THJ2O20AG A10PEXB6XAQ5XF A10X9ME6R66JDX ... AYOTEJ617O60K AYP0YPLSP9ISM AZ515FFZ7I2P7 AZ8XSDMIX04VJ AZAC8O310IK4E AZBXKUH4AIW3X AZCE11PSTCH1L AZMY6E8B52L2T AZNUHQSHZHSUE AZOK5STV85FBJ
product_id
0594451647 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0594481813 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0970407998 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0972683275 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1400501466 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 1540 columns

Deliverable 3. Split the data randomly into train and test dataset. ( For  example, split it in 70/30 ratio) - (2.5 Marks) 

In [17]:
#Split the data randomnly into test and train datasets
#Split the training and test data in the ratio 70:30
train_data, test_data = train_test_split(df_final, test_size = 0.3, random_state=0)
train_data.head()
Out[17]:
user_id product_id ratings
6595853 A2BYV7S1QP2YIG B009EAHVTA 5.0
4738241 AB094YABX21WQ B0056XCEAA 1.0
4175596 A3D0UM4ZD2CMAW B004I763AW 5.0
3753016 AATWFX0ZZSE6C B0040NPHMO 3.0
1734767 A1NNMOD9H36Q8E B0015VW3BM 4.0
In [18]:
train_data.shape
test_data.shape
Out[18]:
(88109, 3)
Out[18]:
(37762, 3)

Deliverable 4) Build Popularity Recommender model. - (20 Marks) 

Approach 1: Using a user defined function recomend.

In [19]:
#Count of user_id for each unique product as recommendation score 
train_data_grouped = train_data.groupby('product_id').agg({'user_id': 'count'}).reset_index()
train_data_grouped.rename(columns = {'user_id': 'score'},inplace=True)
train_data_grouped.head()
Out[19]:
product_id score
0 0594451647 1
1 0594481813 1
2 0970407998 1
3 0972683275 3
4 1400501466 4
In [20]:
#Sort the products on recommendation score 
train_data_sort = train_data_grouped.sort_values(['score', 'product_id'], ascending = [0,1]) 
      
#Generate a recommendation rank based upon score 
train_data_sort['Rank'] = train_data_sort['score'].rank(ascending=0, method='first') 
          
#Get the top 5 recommendations 
popularity_recommendations = train_data_sort.head(5) 
popularity_recommendations 
Out[20]:
product_id score Rank
30847 B0088CJT4U 133 1.0
30287 B007WTAJTO 124 2.0
19647 B003ES5ZUU 122 3.0
8752 B000N99BBC 114 4.0
30555 B00829THK0 97 5.0
In [21]:
# Use popularity based recommender model to make predictions
def recommend(user_id):     
    user_recommendations = popularity_recommendations 
          
    #Add user_id column for which the recommendations are being generated 
    user_recommendations['userID'] = user_id 
      
    #Bring user_id column to the front 
    cols = user_recommendations.columns.tolist() 
    cols = cols[-1:] + cols[:-1] 
    user_recommendations = user_recommendations[cols] 
          
    return user_recommendations 
In [22]:
 # This list is user choice.

find_recom = [10,22,53]   
for i in find_recom:
    print("Here is the recommendation for the userId: %d\n" %(i))
    print(recommend(i))    
    print("\n") 
Here is the recommendation for the userId: 10

       userID  product_id  score  Rank
30847      10  B0088CJT4U    133   1.0
30287      10  B007WTAJTO    124   2.0
19647      10  B003ES5ZUU    122   3.0
8752       10  B000N99BBC    114   4.0
30555      10  B00829THK0     97   5.0


Here is the recommendation for the userId: 22

       userID  product_id  score  Rank
30847      22  B0088CJT4U    133   1.0
30287      22  B007WTAJTO    124   2.0
19647      22  B003ES5ZUU    122   3.0
8752       22  B000N99BBC    114   4.0
30555      22  B00829THK0     97   5.0


Here is the recommendation for the userId: 53

       userID  product_id  score  Rank
30847      53  B0088CJT4U    133   1.0
30287      53  B007WTAJTO    124   2.0
19647      53  B003ES5ZUU    122   3.0
8752       53  B000N99BBC    114   4.0
30555      53  B00829THK0     97   5.0


Approach 2: By calculating mean counts.

In [23]:
df_final.groupby('product_id')['ratings'].mean().head() 
Out[23]:
product_id
0594451647    5.000000
0594481813    3.000000
0970407998    2.500000
0972683275    4.750000
1400501466    3.333333
Name: ratings, dtype: float64
In [24]:
df_final.groupby('product_id')['ratings'].mean().sort_values(ascending=False).head()  
Out[24]:
product_id
B00LKG1MC8    5.0
B002QUZM3M    5.0
B002QWNZHU    5.0
B002QXZPFE    5.0
B002R0DWNS    5.0
Name: ratings, dtype: float64
In [25]:
df_final.groupby('product_id')['ratings'].count().sort_values(ascending=False).head()  
Out[25]:
product_id
B0088CJT4U    206
B003ES5ZUU    184
B000N99BBC    167
B007WTAJTO    164
B00829TIEK    149
Name: ratings, dtype: int64
In [26]:
ratings_mean_count = pd.DataFrame(df_final.groupby('product_id')['ratings'].mean()) 
In [27]:
ratings_mean_count['rating_counts'] = pd.DataFrame(df_final.groupby('product_id')['ratings'].count())  
In [28]:
ratings_mean_count.head(10)  
Out[28]:
ratings rating_counts
product_id
0594451647 5.000000 1
0594481813 3.000000 1
0970407998 2.500000 2
0972683275 4.750000 4
1400501466 3.333333 6
1400501520 5.000000 1
1400501776 4.500000 2
1400532620 3.000000 2
1400532655 3.833333 6
140053271X 2.500000 2

Description : Since this is a system based on popuarity so we can not predict ratings based on particular user, both the above approaches gives the recommendations of product based on their popularity And recommendations remain the same for all users.(It is not personalized to particular user

Deliverable 5. Build Collaborative Filtering model. - (20 Marks) 

Approach 1: Model-based Collaborative Filtering: Singular Value Decomposition(Calculation is based on user defined function)

In [29]:
df_CF = pd.concat([train_data, test_data]).reset_index()
df_CF.head()
Out[29]:
index user_id product_id ratings
0 6595853 A2BYV7S1QP2YIG B009EAHVTA 5.0
1 4738241 AB094YABX21WQ B0056XCEAA 1.0
2 4175596 A3D0UM4ZD2CMAW B004I763AW 5.0
3 3753016 AATWFX0ZZSE6C B0040NPHMO 3.0
4 1734767 A1NNMOD9H36Q8E B0015VW3BM 4.0
In [30]:
#User-based Collaborative Filtering
# Matrix with row per 'user' and column per 'item' 
pivot_df = df_CF.pivot(index = 'user_id', columns ='product_id', values = 'ratings').fillna(0)
pivot_df.shape
pivot_df.head()
Out[30]:
(1540, 48190)
Out[30]:
product_id 0594451647 0594481813 0970407998 0972683275 1400501466 1400501520 1400501776 1400532620 1400532655 140053271X ... B00L5YZCCG B00L8I6SFY B00L8QCVL6 B00LA6T0LS B00LBZ1Z7K B00LED02VY B00LGN7Y3G B00LGQ6HL8 B00LI4ZZO8 B00LKG1MC8
user_id
A100UD67AHFODS 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
A100WO06OQR8BQ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
A105S56ODHGJEK 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
A105TOJ6LTVMBG 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
A10AFVU66A79Y1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 48190 columns

In [31]:
 pivot_df['user_index'] = np.arange(0, pivot_df.shape[0], 1)
pivot_df.head()
Out[31]:
product_id 0594451647 0594481813 0970407998 0972683275 1400501466 1400501520 1400501776 1400532620 1400532655 140053271X ... B00L8I6SFY B00L8QCVL6 B00LA6T0LS B00LBZ1Z7K B00LED02VY B00LGN7Y3G B00LGQ6HL8 B00LI4ZZO8 B00LKG1MC8 user_index
user_id
A100UD67AHFODS 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
A100WO06OQR8BQ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1
A105S56ODHGJEK 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2
A105TOJ6LTVMBG 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3
A10AFVU66A79Y1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4

5 rows × 48191 columns

In [32]:
pivot_df.set_index(['user_index'], inplace=True)

# Actual ratings given by users
pivot_df.head()
Out[32]:
product_id 0594451647 0594481813 0970407998 0972683275 1400501466 1400501520 1400501776 1400532620 1400532655 140053271X ... B00L5YZCCG B00L8I6SFY B00L8QCVL6 B00LA6T0LS B00LBZ1Z7K B00LED02VY B00LGN7Y3G B00LGQ6HL8 B00LI4ZZO8 B00LKG1MC8
user_index
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 48190 columns

In [33]:
# SVD method
from scipy.sparse.linalg import svds
# Singular Value Decomposition
U, sigma, Vt = svds(pivot_df, k = 10)
# Construct diagonal array in SVD
sigma = np.diag(sigma)
In [34]:
all_user_predicted_ratings = np.dot(np.dot(U, sigma), Vt) 

# Predicted ratings
preds_df = pd.DataFrame(all_user_predicted_ratings, columns = pivot_df.columns)
preds_df.head()
Out[34]:
product_id 0594451647 0594481813 0970407998 0972683275 1400501466 1400501520 1400501776 1400532620 1400532655 140053271X ... B00L5YZCCG B00L8I6SFY B00L8QCVL6 B00LA6T0LS B00LBZ1Z7K B00LED02VY B00LGN7Y3G B00LGQ6HL8 B00LI4ZZO8 B00LKG1MC8
0 0.002661 0.003576 0.004050 0.006906 0.003967 0.003073 0.005782 0.000568 0.014386 0.002708 ... 6.108890e-04 0.044224 0.002919 0.060347 -0.002137 0.006751 0.001525 0.130951 0.059243 0.015014
1 0.002262 0.002505 0.005136 0.016517 0.007120 0.001438 0.013258 0.000335 -0.003781 0.001190 ... 2.024793e-04 0.029213 0.000010 0.000244 -0.003111 -0.000621 0.004409 -0.039241 -0.006889 0.003696
2 -0.001600 -0.002502 0.002186 0.016742 0.006716 -0.002113 0.005805 0.003497 -0.005009 -0.001588 ... -3.240446e-04 0.009180 -0.002459 -0.016922 0.019936 -0.002483 -0.000155 -0.002889 -0.011522 -0.004525
3 0.002732 0.003867 0.001799 0.009395 0.004075 0.002778 0.003507 0.000095 0.007983 0.002381 ... 6.031462e-04 -0.003369 0.003433 -0.003428 -0.000750 0.000119 0.002612 -0.015107 -0.006740 0.003276
4 0.000704 0.000085 0.002051 0.009664 0.004438 0.000335 0.005992 0.001056 -0.000369 0.000373 ... 3.745108e-08 -0.001140 -0.000323 -0.025215 0.004700 -0.002170 0.001263 -0.048555 -0.016301 -0.003377

5 rows × 48190 columns

In [41]:
# Recommend the items with the highest predicted ratings

def recommend_items(userID, pivot_df, preds_df, num_recommendations):
      
    user_idx = userID-1 # index starts at 0
    
    # Get and sort the user's ratings
    sorted_user_ratings = pivot_df.iloc[user_idx].sort_values(ascending=False)
    #sorted_user_ratings
    sorted_user_predictions = preds_df.iloc[user_idx].sort_values(ascending=False)
    #sorted_user_predictions

    temp = pd.concat([sorted_user_ratings, sorted_user_predictions], axis=1)
    temp.index.name = 'Recommended Items'
    temp.columns = ['user_ratings', 'user_predictions']
    
    temp = temp.loc[temp.user_ratings == 0]   
    temp = temp.sort_values('user_predictions', ascending=False)
    print('\nBelow are the recommended items for user(user_id = {}):\n'.format(userID))
    print(temp.head(num_recommendations))
In [39]:
#Enter 'userID' and 'num_recommendations' for the user #
userID = 12
num_recommendations = 5
recommend_items(userID, pivot_df, preds_df, num_recommendations)
Below are the recommended items for user(user_id = 12):

                   user_ratings  user_predictions
Recommended Items                                
B0088CJT4U                  0.0          1.639356
B000N99BBC                  0.0          1.328095
B00829TIEK                  0.0          1.269074
B008DWCRQW                  0.0          1.145459
B004CLYEDC                  0.0          1.125218
In [37]:
# Actual ratings given by the users
final_ratings_matrix.head()
Out[37]:
product_id 0594451647 0594481813 0970407998 0972683275 1400501466 1400501520 1400501776 1400532620 1400532655 140053271X ... B00L5YZCCG B00L8I6SFY B00L8QCVL6 B00LA6T0LS B00LBZ1Z7K B00LED02VY B00LGN7Y3G B00LGQ6HL8 B00LI4ZZO8 B00LKG1MC8
user_id
A100UD67AHFODS 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
A100WO06OQR8BQ 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
A105S56ODHGJEK 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
A105TOJ6LTVMBG 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
A10AFVU66A79Y1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 48190 columns

In [42]:
# Average ACTUAL rating for each item
final_ratings_matrix.mean().head()
Out[42]:
product_id
0594451647    0.003247
0594481813    0.001948
0970407998    0.003247
0972683275    0.012338
1400501466    0.012987
dtype: float64
In [43]:
# Predicted ratings 
preds_df.head()
Out[43]:
product_id 0594451647 0594481813 0970407998 0972683275 1400501466 1400501520 1400501776 1400532620 1400532655 140053271X ... B00L5YZCCG B00L8I6SFY B00L8QCVL6 B00LA6T0LS B00LBZ1Z7K B00LED02VY B00LGN7Y3G B00LGQ6HL8 B00LI4ZZO8 B00LKG1MC8
0 0.002661 0.003576 0.004050 0.006906 0.003967 0.003073 0.005782 0.000568 0.014386 0.002708 ... 6.108890e-04 0.044224 0.002919 0.060347 -0.002137 0.006751 0.001525 0.130951 0.059243 0.015014
1 0.002262 0.002505 0.005136 0.016517 0.007120 0.001438 0.013258 0.000335 -0.003781 0.001190 ... 2.024793e-04 0.029213 0.000010 0.000244 -0.003111 -0.000621 0.004409 -0.039241 -0.006889 0.003696
2 -0.001600 -0.002502 0.002186 0.016742 0.006716 -0.002113 0.005805 0.003497 -0.005009 -0.001588 ... -3.240446e-04 0.009180 -0.002459 -0.016922 0.019936 -0.002483 -0.000155 -0.002889 -0.011522 -0.004525
3 0.002732 0.003867 0.001799 0.009395 0.004075 0.002778 0.003507 0.000095 0.007983 0.002381 ... 6.031462e-04 -0.003369 0.003433 -0.003428 -0.000750 0.000119 0.002612 -0.015107 -0.006740 0.003276
4 0.000704 0.000085 0.002051 0.009664 0.004438 0.000335 0.005992 0.001056 -0.000369 0.000373 ... 3.745108e-08 -0.001140 -0.000323 -0.025215 0.004700 -0.002170 0.001263 -0.048555 -0.016301 -0.003377

5 rows × 48190 columns

In [44]:
# Average PREDICTED rating for each product
preds_df.mean().head()
Out[44]:
product_id
0594451647    0.001542
0594481813    0.002341
0970407998    0.002597
0972683275    0.011807
1400501466    0.004848
dtype: float64
In [45]:
rmse_df = pd.concat([final_ratings_matrix.mean(), preds_df.mean()], axis=1)
rmse_df.columns = ['Avg_actual_ratings', 'Avg_predicted_ratings']
print(rmse_df.shape)
rmse_df['item_index'] = np.arange(0, rmse_df.shape[0], 1)
rmse_df.head()
(48190, 2)
Out[45]:
Avg_actual_ratings Avg_predicted_ratings item_index
product_id
0594451647 0.003247 0.001542 0
0594481813 0.001948 0.002341 1
0970407998 0.003247 0.002597 2
0972683275 0.012338 0.011807 3
1400501466 0.012987 0.004848 4
In [46]:
#accuracy check based on RMSe 

RMSE = round((((rmse_df.Avg_actual_ratings - rmse_df.Avg_predicted_ratings) ** 2).mean() ** 0.5), 5)
print('\nRMSE SVD Model = {} \n'.format(RMSE))
RMSE SVD Model = 0.0033 

Approach 2: Using surprise library :

To load a dataset from a pandas dataframe, we will use the load_from_df() method, we will also need a Reader object, and the rating_scale parameter must be specified. The dataframe must have three columns, corresponding to the user ids, the item ids, and the ratings in this order. Each row thus corresponds to a given rating.

With the Surprise library, we will benchmark the following algorithms

Basic algorithms

NormalPredictor** NormalPredictor algorithm predicts a random rating based on the distribution of the training set, which is assumed to be normal. This is one of the most basic algorithms that do not do much work.

BaselineOnly BasiclineOnly algorithm predicts the baseline estimate for given user and item.

k-NN algorithms : KNNBasic

KNNBasic is a basic collaborative filtering algorithm.

KNNWithMeans

KNNWithMeans is basic collaborative filtering algorithm, taking into account the mean ratings of each user.

KNNWithZScore

KNNWithZScore is a basic collaborative filtering algorithm, taking into account the z-score normalization of each user.

KNNBaseline

KNNBaseline is a basic collaborative filtering algorithm taking into account a baseline rating.

Matrix Factorization-based algorithms

SVD

SVD algorithm is equivalent to Probabilistic Matrix Factorization.

SVDpp

The SVDpp algorithm is an extension of SVD that takes into account implicit ratings.

NMF

NMF is a collaborative filtering algorithm based on Non-negative Matrix Factorization. It is very similar with SVD.

Slope One

Slope One is a straightforward implementation of the SlopeOne algorithm.

Co-clustering

Co-clustering is a collaborative filtering algorithm based on co-clustering.

In [47]:
from surprise import accuracy
from surprise import Reader
from surprise.model_selection import train_test_split
In [48]:
reader = Reader(rating_scale=(1, 5))
In [26]:
# Load the  dataset from dataframe: User_id::product_id:Ratings
data1 = Dataset.load_from_df(df_final[['user_id', 'product_id', 'ratings']], reader)
In [3]:
#benchmark = []
# Iterate over all algorithms
#for algorithm in [SVD(), SVDpp(), SlopeOne(), KNNWithMeans(), KNNwithZscore(), BaselineOnly(),CoClustering()]
    # Perform cross validation
    #results = cross_validate(algorithm, data1, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    #tmp = pd.DataFrame.from_dict(results).mean(axis=0)
       
    #tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    #benchmark.append(tmp)

#surprise_results = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')
#surprise_results

Description : From the above code, i have build collaborative filtering model based on algorithms availble on surprise library.This code is using 'for' loop and cross_validate function to evaluate each algorithm and print the best results based on test_rmse. Note : The code is 'running out of memory' in jupyternotebook due to RAM

Deliverable 6: Evaluate both the models. ( Once the model is trained on  the training data, it can be used to compute the error  (RMSE) on predictions made on the test data.) - (7.5 Marks) 

In [49]:
RMSE = round((((rmse_df.Avg_actual_ratings - rmse_df.Avg_predicted_ratings) ** 2).mean() ** 0.5), 5)
print('\nRMSE SVD Model = {} \n'.format(RMSE))
RMSE SVD Model = 0.0033 

Description : There is no way to check the accuracy score of popularity-based recommender system because it is not based on specific user it can only display the products based on its popularity. And on the other hand model based on collabaritve filtering gives an accuracy of 0.033,it indicates an optimal model

Deliverable 7. Get top - K ( K = 5) recommendations. Since our goal is to  recommend new products for each user based on his/her  habits, we will recommend 5 new products. - (7.5 Marks) 

In [50]:
# Enter 'userID' and 'num_recommendations' for the user #
userID = 55
num_recommendations = 5
recommend_items(userID, pivot_df, preds_df, num_recommendations)
Below are the recommended items for user(user_id = 55):

                   user_ratings  user_predictions
Recommended Items                                
B007WTAJTO                  0.0          0.756133
B003ES5ZUU                  0.0          0.528416
B002V88HFE                  0.0          0.415398
B001TH7GUU                  0.0          0.338206
B000QUUFRW                  0.0          0.331853
In [51]:
# Enter 'userID' and 'num_recommendations' for the user #
userID = 80
num_recommendations = 5
recommend_items(userID, pivot_df, preds_df, num_recommendations)
Below are the recommended items for user(user_id = 80):

                   user_ratings  user_predictions
Recommended Items                                
B007WTAJTO                  0.0          0.083453
B003ES5ZUU                  0.0          0.074858
B002V88HFE                  0.0          0.046475
B001TH7GUU                  0.0          0.043768
B000QUUFRW                  0.0          0.038911

Deliverable 8: Summarise your insights. - (7.5 marks) 

Summary:

  • I have build two types of recomender systems based on two different approaches i.e Popularity based recommender system and collaborative filtering model.

  • In the first model i have used two approaches ie.User defined function and mean count. Both models are doing same work but in in a different manner And both are giving equal prediction results.

  • In the Second model, i have build collaborative filtering model in two differnt ways, in first approach i have written fucntion for SVD,On the other hand used suprise library and all the supportive algorithms for collaborative filtering in the second approach.

  • According to this particular dataset i don't find that popuraity based system would be a good choice because it is a non- personalised system and all predictions will be same for all the users .As we are working for E-commerce industry it is not a reliable tool to recommend same items to all the users

  • On the other hand , Collaborative filtering based on SVD is showing good predictions and giving good accuracy(0.033) so we can consider it as an optimal model.

Thank you